ETC5521 Diving Deeper into Data Exploration: Assignment 2

As per Monash’s integrity rules, these solutions are not to be shared beyond this class.

Author

Prof. Di Cook

Published

August 12, 2024

🎯 Goal

The assignment is designed to exploring single and pairs of variables, and conduct appropriate tests for significance of patterns. The assignment represents 20% of your final grade for ETC5521. This is an individual assignment.

📌 Guidelines

  1. Accept the GitHub Classroom Assignment provided in Moodle using a GitHub Classroom compatible web browser. This should generate a private GitHub repository that can be found at https://github.com/etc5521-2024. Your GitHub assignment 2 repo should contain the file assign02.html, README.md, assign02-submission.qmd, assignment.css, etc5521-assignment2.Rproj, .gitignore, and data files generated from your work needed to render your solution file. Code should be available but folded in report.

  2. Answer each question in the assign02-submission.qmd in the repo.

  3. For the final submission knit assign03-submission.qmd which will contain your answers. Make sure to provide the link to the script of any Generative AI conversation you employed in arriving at your solution. Note that marks are allocated for overall grammar and structure of your final report.

  4. Leave all of your files in your GitHub repo for marking. We will check your git commit history. You should have contributions to the repo with consistent commits over time. (Note: nothing needs to be submitted to Moodle.)

  5. You are expected to develop your solutions by yourself, without discussing any details with other class members or other friends or contacts. You can ask for clarifications from the teaching team and we encourage you to attend consultations to get assistance as needed. As a Monash student you are expected to adhere to Monash’s academic integrity policy. and the details on use of Generative AI as detailed on this unit’s Moodle assessment overview.

Deadlines:

Due date Turn in
11:45pm Mon Aug 19 Assignment 2 Repo on GitHub has been created
11:45pm Mon Aug 26 Final solutions available on repo

🛠️ Exercises

Question 1: Becoming Tukey

The ABC news ran a story Mon 5 Aug 2024 titled Researchers argue for new, fairer Olympic rankings wishing for a fairer way to measure Olympic success. The official way to measure success is ranking countries by the number of gold medals. (Note that, when the USA was the top country based on total medals most US sources ranked countries by total count not by gold, but this is a side story. Now that it is several weeks in the USA is also leading on gold medals.)

However, by this system, the top-ranked nations rarely change. The question is posed whether there are fairer ways to measure success. The article discusses using medals per capita, and a new measure described in the motivating Journal of Sports Analytics called the “U Index”.

Your job is to:

  1. Download the latest Olympic Medal tallies (should be the final tallies by the time the assignment is due to be submitted). There are various ways to get this data, in particular you are encouraged to script it with web scraping using rvest. Report the top 10 countries by gold medal count.
  2. Compile the current population for each country, and compute a per capita medal count, and report for the top 10 countries.
  3. Invent five other ways to calculate and rank the Olympic performance of competing countries using this data, possibly augmented with new data. Explain what you think your calculation is measuring about performance.
  1. The best place to get the data is from kaggle but at the time of writing the assignment, it was only available in dynamically updated websites, like wikipedia. This is the code that can be used to scrape the data. Generally when you do this you want to save a local copy of the data, because the website might change, and the code may not work for long.
# Using wikipedia as the medal tally source
# The downside of this is that it is missing the zeros
# This data needs to be saved because web site is always changing
pg <- read_html("https://en.wikipedia.org/wiki/2024_Summer_Olympics_medal_table")
tbls <- pg |> html_elements("table")

medals <- tbls[4] |> 
  html_table()
medals <- medals[1][[1]]

# By number of gold
medals |>
  slice_head(n=10)

medals <- medals |>
  filter(!str_detect(NOC, "Totals")) |>
  mutate(NOC = str_remove(NOC, "‡")) |>
  mutate(NOC = str_remove(NOC, "\\*")) |>
  mutate(Total = as.numeric(Total)) |>
  mutate(p_gold = Gold/Total) 

# Note, elected to keep individual athletes, and refugee medal winners
write_csv(medals, file="data/medals.csv")
# By number of gold
medals <- read_csv(here("assignments/data/medals.csv"))
medals |>
  select(NOC, Gold) |>
  slice_head(n=10)
# A tibble: 10 × 2
   NOC            Gold
   <chr>         <dbl>
 1 United States    40
 2 China            40
 3 Japan            20
 4 Australia        18
 5 France           16
 6 Netherlands      15
 7 Great Britain    14
 8 South Korea      13
 9 Italy            12
10 Germany          12
  1. Population data can scraped from wikipedia also, with the code below.
# Best to save this data also
pop_tbl <- read_html("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population") |> 
  html_elements("table") 
pop_tbl <- pop_tbl[1][[1]]

pop <- pop_tbl |>
  html_table()

pop <- pop |>
  select(Location, Population, `% ofworld`) |>
  filter(!is.na(Population)) |>
  mutate(Population = str_remove_all(Population, ","),
         `% ofworld` = str_remove(`% ofworld`, "%")) |>
  mutate(Population = as.numeric(Population),
         `% ofworld` = as.numeric(`% ofworld`))

# Need to recode some countries, based on the 
# checks made in the next few lines
pop <- pop |> 
  mutate(Location = fct_recode(Location,
    "Great Britain" = "United Kingdom",
    "Hong Kong" = "Hong Kong (China)",
    "Chinese Taipei" = "Taiwan",
    "Puerto Rico" = "Puerto Rico (US)"))

write_csv(pop, file="data/pop.csv")
pop <- read_csv(here("assignments/data/pop.csv"))
# Check country matching
medals_pop <- medals |> 
  filter(!(NOC %in% c("Totals (80 entries)", 
                      "Individual Neutral Athletes[A]",
                      "Refugee Olympic Team", 
                      "Individual Neutral Athletes[A][B]"))) |>
  mutate(NOC = str_remove(NOC, "\\*")) |>
  left_join(pop, by=c("NOC"="Location"))

#medals_pop |> 
#  filter(is.na(Population)) |>
#  pull(NOC) # All matched after name changes

medals_pop <- medals_pop |>
  mutate(percap = Gold/Population*1000000)

medals_pop |>
  arrange(desc(percap)) |>
  select(NOC, percap) |>
  slice_head(n=10)
# A tibble: 10 × 2
   NOC         percap
   <chr>        <dbl>
 1 Dominica    14.8  
 2 Saint Lucia  5.43 
 3 New Zealand  1.87 
 4 Bahrain      1.27 
 5 Slovenia     0.941
 6 Netherlands  0.834
 7 Georgia      0.812
 8 Ireland      0.757
 9 Norway       0.718
10 Australia    0.658

There are many possible choices of metrics to create, including these from the submissions:

  • proportion of gold to total
  • use a countries GDP or socioeconomic index instead of population
  • total medals: so countries without top performance feature.
  • using a point system: which gives a weighted rank.
  • by number of athletes: teams with small number of high-performing athletes at top
  • by number of athletes but calibrated by number of events each participated in.
  • by number of athletes but calibrated by number of team members.
  • medals per GDP
  • improvement over 2020 Olympics: Total 2024 - Total 2020. Note this should be divided by Total 2020, to assess relative increase. Any country that scores a lot of medals would likely score more if only using the difference. - gender balance: (male total - female total) / total
# By proportion of gold
medals |> 
  arrange(desc(p_gold)) |>
  select(NOC, p_gold) |>
  slice_head(n=10)
# A tibble: 10 × 2
   NOC            p_gold
   <chr>           <dbl>
 1 Dominica        1    
 2 Pakistan        1    
 3 Slovenia        0.667
 4 Algeria         0.667
 5 Indonesia       0.667
 6 Uzbekistan      0.615
 7 Serbia          0.6  
 8 Czech Republic  0.6  
 9 Ireland         0.571
10 New Zealand     0.5  
# Removing the countries with only 1 medal
medals |> 
  filter(Total > 1) |>
  arrange(desc(p_gold)) |>
  select(NOC, p_gold) |>
  slice_head(n=10)
# A tibble: 10 × 2
   NOC            p_gold
   <chr>           <dbl>
 1 Slovenia        0.667
 2 Algeria         0.667
 3 Indonesia       0.667
 4 Uzbekistan      0.615
 5 Serbia          0.6  
 6 Czech Republic  0.6  
 7 Ireland         0.571
 8 New Zealand     0.5  
 9 Norway          0.5  
10 Bahrain         0.5  
# Removing the countries with less than 5 medals
medals |> 
  filter(Total > 4) |>
  arrange(desc(p_gold)) |>
  select(NOC, p_gold) |>
  slice_head(n=10)
# A tibble: 10 × 2
   NOC            p_gold
   <chr>           <dbl>
 1 Uzbekistan      0.615
 2 Serbia          0.6  
 3 Czech Republic  0.6  
 4 Ireland         0.571
 5 New Zealand     0.5  
 6 Norway          0.5  
 7 Japan           0.444
 8 Netherlands     0.441
 9 China           0.440
10 Georgia         0.429

Now a nice way to summarise the difference that the metrics provide is a line plot, focus just on top ranks.

medals_pop <- medals_pop |>
  mutate(wgt_tot = Bronze + 2*Silver + 3*Gold,
         Rank = as.numeric(Rank)) |>
  mutate(r2 = rank(desc(percap)), r3 = rank(desc(p_gold)), r4 = rank(desc(wgt_tot)))

p <- medals_pop |>
  pivot_longer(c(Rank, r2, r3, r4), 
               names_to = "metric", 
               values_to = "rank") |>
  ggplot(aes(x=metric, y=rank, group=NOC)) + 
    geom_line()
ggplotly(p)

Here, each line corresponds to a country. We can see that:

  • Country rank from gold medal count is very similar to method 4 (weighted medal count). There are a few countries who go down dramatically and a few that go up in rank.
  • The ranking from methods 2 (percapita) and 3 (proportion gold) are quite different, from each other and from the other two.

MARKING:

  • 1 pt for (a) includes explaining how edal data was obtained.
  • 1 pt for (b) includes explaining how population data was obtained.
  • 2.5 pts each of the five metrics in (c).
  • 1.5 pts for explaining how the metrics are calculated and what they measure in (c).

Question 2: Chatfield-style IDA

Download the latest data from the World Development Indicators. Use the last 20 years of records (2004-2022), and all the countries. Select just the variables that have the keyword “fuel” in the explanation.

Your task is to prepare the data, by doing the initial data analysis, needed to answer the question

Is access to clean fuels for cooking related to fuel imports or exports?

You are NOT expected to answer the question. The IDA will likely involve steps such as:

  • filtering countries with few values
  • removing variables with few values
  • imputing missings
  • screening data

Your script needs to completely document all of the IDA conducted, with comments describing the code and decisions made. The raw data and valid data sets need to be added to your repo.

In your report, provide an outline of the IDA conducted, and a few summary statistics and plots of your “valid” data set.

First step is to read the data and tidy it. It is a good place to save the tidied data, and other data products, like the data dictionary (explanation of variables) and full country names with their codes.

# The max removes summary text from bottom of spreadsheet
wdi <- read_xlsx(here::here("data/P_Data_Extract_From_World_Development_Indicators.xlsx"), n_max = 4788)
wdi_cnt <- wdi |>
  select(`Country Name`, `Country Code`) |>
  distinct()
wdi_vars <- wdi |>
  select(`Series Name`, `Series Code`) |>
  distinct()
# Note 2023 is missing
wdi_tidy <- wdi |>
  select(`Country Code`, `Series Code`, `2004 [YR2004]`:`2022 [YR2022]`) |>
    rename_all(janitor::make_clean_names) |>
  pivot_longer(x2004_yr2004:x2022_yr2022,
               names_to = "year", 
               values_to = "value") |>
  mutate(year = as.numeric(str_sub(year, 2, 5)),
         value = as.numeric(value)) |>
  pivot_wider(names_from = series_code,
              values_from = value)
save(wdi_tidy, file="data/wdi_tidy.rda")
save(wdi_cnt, file="data/wdi_cnt.rda")
save(wdi_vars, file="data/wdi_vars.rda")
load("data/wdi_tidy.rda")
load("data/wdi_cnt.rda")
load("data/wdi_vars.rda")

Note that there are 266 countries, and 18 series.

The next step is to check the missingness structure. The process then is to remove variables and countries while retaining as much data as possible. Remove variables that are mostly missing, then countries that are mostly missing. And iterate this, until you have a reasonably large set remaining, but would be able to impute the missing values well.

Here we start with examining missing value summaries.

load("data/wdi_tidy.rda")
vis_miss(wdi_tidy, sort_miss = TRUE)

gg_miss_var(wdi_tidy, show_pct=TRUE)

wdi_miss <- wdi_tidy |> 
  group_by(country_code) |>
  miss_var_summary()

wdi_cnt_miss <- wdi_miss |>
  group_by(country_code) |>
  summarise(pct_miss = mean(pct_miss,
                            na.rm=TRUE)) |>
  arrange(desc(pct_miss))
ggplot(wdi_cnt_miss, 
        aes(x=fct_inorder(country_code),
            y=pct_miss)) +
  geom_point() +
  coord_flip() +
  xlab("") + ylab("% missing") +
  theme(axis.text.y = element_blank())

Now we start the iterative filtering of variables and countries. Checking the missingness structure after each removal.

keep <- wdi_cnt_miss |>
  filter(pct_miss < 42) |> # Iterative: 85, 60
  pull(country_code)
# There are still 181 countries

wdi_tidy <- wdi_tidy |>
  filter(country_code %in% keep)

vis_miss(wdi_tidy, sort_miss = TRUE)
gg_miss_var(wdi_tidy, show_pct=TRUE)

# Only five variables have sufficient data
wdi_clean <- wdi_tidy |>
  select(country_code, year,
         EG.CFT.ACCS.ZS:EG.CFT.ACCS.UR.ZS,
         TX.VAL.FUEL.ZS.UN,
         TM.VAL.FUEL.ZS.UN)

wdi_clean_miss <- wdi_clean |> 
  group_by(country_code) |>
  miss_var_summary()

wdi_clean_cnt_miss <- wdi_clean_miss |>
  group_by(country_code) |>
  summarise(pct_miss = mean(pct_miss,
                            na.rm=TRUE)) |>
  arrange(desc(pct_miss))
keep <- wdi_clean_cnt_miss |>
  filter(pct_miss < 10) |> 
  pull(country_code)

wdi_clean <- wdi_clean |>
  filter(country_code %in% keep)

wdi_clean |>
  pivot_longer(EG.CFT.ACCS.ZS:TM.VAL.FUEL.ZS.UN, 
               names_to="series",
               values_to="value") |>
  ggplot(aes(x=year, y=value)) +
    geom_miss_point() +
  facet_wrap(~series, ncol=3, scales="free_y") +
  theme(aspect.ratio=0.5, 
        legend.position = "none")

# Make variable names nicer
wdi_clean <- wdi_clean |>
  pivot_wider(names_from = "series", values_from = "value") |>
  rename(clean_fuels_all = EG.CFT.ACCS.ZS,
         clean_fuels_rural = EG.CFT.ACCS.RU.ZS,
         clean_fuels_urban = EG.CFT.ACCS.UR.ZS,
         fuel_exports = TX.VAL.FUEL.ZS.UN,
         fuel_imports = TM.VAL.FUEL.ZS.UN)

write_csv(wdi_clean, file="data/wdi_clean.csv")
# Check the clean data
wdi_clean <- read_csv("data/wdi_clean.csv")

Note that there are now 161 countries, and 5 series.

# Impute missings
wdi_valid <- wdi_clean |>
    pivot_longer(clean_fuels_all:fuel_imports, 
               names_to="series",
               values_to="value") |>
  group_by(country_code, series) |>
  mutate(value = na_ma(value)) |>
  ungroup()

wdi_valid |>
    ggplot(aes(x=year, y=value,
               group=country_code)) +
    geom_line() +
  facet_wrap(~series, ncol=3, scales="free_y") +
  theme(aspect.ratio=0.5, 
        legend.position = "none")

# Data screening: plot samples of countries
wdi_valid |> 
  pivot_wider(names_from = series,
              values_from = value) |>
  filter(country_code %in% c("AUS", "BEL", "BRA", "NGA")) |>
  ggplot(aes(x=clean_fuels_all,
           y=fuel_imports)) +
  geom_point() +
  facet_wrap(~country_code, scales = "free")

We can see that some countries have 100% clean fuels available to all.

# Might also need to filter out countries with constant values 
# over time on these variables. Skip this though.
wdi_clean_const <- wdi_clean |>
  group_by(country_code, series) |>
  summarise(value = sd(value)) |>
  ungroup()
drop <- wdi_clean_const |>
  filter(value < 0.001) |>
  select(country_code) |>
  distinct() |>
  pull(country_code)

wdi_valid <- wdi_clean |>
  filter(!(country_code %in% drop)) 

Save the final valid data set to a file, for later analysis.

write_csv(wdi_valid, "data/wdi_valid.csv")

MARKING:

  • 1 pt for having correct data from WDI database
  • 3 pts for missing value summaries (heatmap, %variable, %country)
  • 2 pts for imputing appropriately using na_ma
  • 1 pt for finishing with roughly 150 countries and the five variables
  • 1 for summarising final data saved.

Question 3: Check the data collection

This paper describes an experiment to determine if a choropleth map or a hexagon tile map is better for displaying different types of statistical measurements spatially, especially focused on Australia.

The experimental data is available in the file data/DAT_HexmapPilotData_V1_20191115.csv.

The variables in this data set are:

  • group: sets provided to subject, A or B
  • trend: type of pattern simulated in the data, NW-SE, all cities, three cities
  • location: which plot in the lineup of 20 was the data plot
  • type: plot type, “Geography” is a choropleth map, “Hexagons” is a hexagon tile map
  • choice: which plot from the lineup was is that the subject picked
  • reason: their explanation for the choice
  • certainty: how confident were they in their choice
  • time: timestamp to allow examination of time taken
  • order: each subject saw 12 lineups in this order
  • replicate: different simulated data generated
  • id: unique id for each lineup used
  • detect: was calculated from whether the plot chosen matched the data plot

Your job is to check whether the data collected is completely covering the experimental design, or whether there were any problems with the data. Your answer should include information, and summary statistics and/or plots that:

  • identify which columns of the data match the factors in the experiment.
  • check the number of measurements collected for each of the treatments.
  • Factors:
    • trend: NW-SE, all cities, three cities
    • type: Geography, Hexagons
  • Measured variables:
    • choice: which plot the subject picked
    • reason: their explanation for the choice
    • certainty: how confident were they in their choice
    • time: timestamp to allow examination of time taken
    • detect: was calculated from whether the plot chosen matched the data plot

The group (A or B) was used for assignment ot people to treatments, to make sure that each subject saw either the Geography or Hexagon plot of any data but not both.

There were three replicates for each trend type.

The location of the data plot was randomised in each lineup. This is not a factor, just a precaution preventing people from expecting the data plot to be in a particular location.

Randomising the order that the plots were shown to subjects was a precaution against an effect occurring based on when the plots were viewed.

The steps to checking the design are:

  1. compare counts by factors
  2. compare counts by treatments
  3. discover the experimental design by pivoting on type or group to see same data was not shown to a subject twice
  4. check order lineups seen has not particular patterns
expt <- read_csv("https://raw.githubusercontent.com/srkobakian/experiment/master/data/DAT_HexmapPilotData_V1_20191115.csv")
expt |> count(trend)
# A tibble: 3 × 2
  trend            n
  <chr>        <int>
1 NW-SE           84
2 all cities      84
3 three cities    84
expt |> count(type)
# A tibble: 2 × 2
  type          n
  <chr>     <int>
1 Geography   126
2 Hexagons    126
expt |>
  count(trend, type) |>
  knitr::kable()
trend type n
NW-SE Geography 42
NW-SE Hexagons 42
all cities Geography 42
all cities Hexagons 42
three cities Geography 42
three cities Hexagons 42
expt |>
  count(group, trend, type) |>
  knitr::kable()
group trend type n
A NW-SE Geography 20
A NW-SE Hexagons 20
A all cities Geography 20
A all cities Hexagons 20
A three cities Geography 20
A three cities Hexagons 20
B NW-SE Geography 22
B NW-SE Hexagons 22
B all cities Geography 22
B all cities Hexagons 22
B three cities Geography 22
B three cities Hexagons 22
expt |>
  count(group, replicate, trend, type) |>
  pivot_wider(names_from = type, 
              values_from = n, 
              values_fill = 0) |>
  knitr::kable()
group replicate trend Hexagons Geography
A 1 all cities 10 0
A 2 all cities 0 10
A 3 all cities 10 0
A 4 all cities 0 10
A 5 NW-SE 0 10
A 6 NW-SE 0 10
A 7 NW-SE 10 0
A 8 NW-SE 10 0
A 9 three cities 0 10
A 10 three cities 0 10
A 11 three cities 10 0
A 12 three cities 10 0
B 1 all cities 0 11
B 2 all cities 11 0
B 3 all cities 0 11
B 4 all cities 11 0
B 5 NW-SE 11 0
B 6 NW-SE 11 0
B 7 NW-SE 0 11
B 8 NW-SE 0 11
B 9 three cities 11 0
B 10 three cities 11 0
B 11 three cities 0 11
B 12 three cities 0 11
expt |>
  count(group, replicate, trend, type) |>
  pivot_wider(names_from = group, 
              values_from = n, 
              values_fill = 0) |>
  knitr::kable()
replicate trend type A B
1 all cities Hexagons 10 0
2 all cities Geography 10 0
3 all cities Hexagons 10 0
4 all cities Geography 10 0
5 NW-SE Geography 10 0
6 NW-SE Geography 10 0
7 NW-SE Hexagons 10 0
8 NW-SE Hexagons 10 0
9 three cities Geography 10 0
10 three cities Geography 10 0
11 three cities Hexagons 10 0
12 three cities Hexagons 10 0
1 all cities Geography 0 11
2 all cities Hexagons 0 11
3 all cities Geography 0 11
4 all cities Hexagons 0 11
5 NW-SE Hexagons 0 11
6 NW-SE Hexagons 0 11
7 NW-SE Geography 0 11
8 NW-SE Geography 0 11
9 three cities Hexagons 0 11
10 three cities Hexagons 0 11
11 three cities Geography 0 11
12 three cities Geography 0 11
# Check the order
expt |>
  ggplot(aes(x=order, y=id,
             fill=type)) +
  geom_tile() +
  theme(aspect.ratio = 0.5,
        axis.text = element_blank(),
        axis.title = element_blank())

expt |>
  ggplot(aes(x=order, y=id,
             fill=trend)) +
  geom_tile() +
  theme(aspect.ratio = 0.5,
        axis.text = element_blank(),
        axis.title = element_blank())

In summary, it’s a very nicely organised experiment. The treatments were seen relatively equally among subjects. The order from one person to another was beautifully varied. The sets A and B ensured that no-one saw the same data twice.

The only small issues were slightly different numbers saw set A and B. And there may have been a small concentration of the hexagon type being seen after the geography type.

MARKING:

  • 2 points for appropriately listing factors, and measured variables
  • 2 points for appropriate numerical tabulation of factors
  • 2 points for the more sophisticated checking which reveals the assignment to subjects, and order lineups were provided to subjects.

Generative AI analysis

In this part, we would like you to actively discuss how generative AI helped with your answers to the assignment questions, and where or how it was mistaken or misleading.

You need to provide a link that makes the full script of your conversation with any generative AI tool accessible to the teaching staff. You should not use a paid service, as the freely available systems will sufficiently helpful.

The link to my use of ChatGPT for help on this project is XXX.

Marks

Part Points
Q1 worth 6
Q2 worth 8
Q3 worth 6
GitHub Repo -2
Generative AI Analysis -3
Formatting, Spelling & Grammar -3

Note that the negative marks for “Generative AI Analysis”, “Formatting, Spelling & Grammar” correspond to reductions in scores. You can lose up to 3 marks for poor use of the GAI. For example, no use, basic questions only, no link to the script, and no acknowledgment but clearly used. You can lose up to 3 marks for poorly formatted and written answers. Two marks will be deducted if you have NOT accepted the assignment and created your own repo by 11:45pm Mon Aug 19.

Here are some guidelines for code:

  • Split chunks that process data and ones that plot data
  • It is dangerous to run scraping code each render because web sites often change. The wikipedia page has changed since the assignment was made available. Always best to scrape once, and save the result.
  • Label the chunks - helps to locate errors
  • Comment the code so you can remember what each is supposed to do when you come back to it.
  • Learn how to name things well, see https://github.com/jennybc/how-to-name-files